Team Illinois Members


Introduction

The purpose of this project is to attempt to build models which can be used to accurately predict premature death based on various health and socio-economic factors. The data for this project were obtained from https://www.countyhealthrankings.org/. County Health Rankings ranks each county within each state based on a variety of health measures obtained from national and state data sources. One of the many attributes contained in the raw data is the “Years of Potential Life Lost Rate” (YPLLRate) which represents premature death for the a given geographical region. The YPLLRate value is the number of years lost per 100,000 people based on the various health and socio-economic factors present in the data. This project seeks to model the interaction between the YPLLRate serving as the response variable and the other attributes serving as the predictor variables.

The raw data for this project were obtained from https://www.countyhealthrankings.org/. County Health Rankings obtained this data from various nation-wide and state data sources. The raw data consisted of over 240 variables and the preliminary cleaning reduced this number to 96 by removing redundancies and sparsely populated variables.

The redundant variables included many examples of providing a quantity and a percentage of some factor. For example, there were variables for the number of smokers in a region as well as the percentage of the population that smoked. The percentage variables were chosen over the quantity variables and the quantity variable were omitted There were also many race-specific variables that had N/A values in too many observations to be useful as predictors. These variables were omitted as well.

The remaining predictor variables are made up of categorical geographic data such as state and county, health related numeric variables such as percentage of low birth weight, percentage of smokers, and percentage of adults with diabetes, and socio-economic numeric variables such as percentage of single parent households, percentage of people with some college, and percentage of people unemployed.

The objective of this project is to identify which combination of the remaining predictors most-contributes to premature death.

Methods

The large number of predictors in the cleaned data set made model selection very difficult. No single approach seemed to emerge as the best route so we opted to pursue several approaches and compare the results. Model-selection algorithm proved to be computationally infeasible in that it ran for exceptionally long times and the inclusion of the State and Region categorical variables made the number of permutations far too high to be useful. Our three approaches are similar in some ways and different in others as detailed below.

Approach 1

This approach involved attempting to manually select a smaller set of predictors based on their textual descriptions and their individual linear relationships to the response variable.

data_1 = read_csv("data/model_1_data.csv")
pairs(data_1)

Below are plots of the relationships between the response and several predictors.

Transforming predictor variables had no effect on model improvement, but performing a log transformation on the YPLLRate response variable did show improvement. The model was selected by a forward AIC selection with the scope being set to the full interactive model of the chosen predictors.

empty_model = lm(log(YPLLRate) ~ 1, data = data_1)

selected_model = step(empty_model, scope = log(YPLLRate) ~ HealthBadDaysMentAvg * HospPrevRate * 
                                                           Income20thpcntl * InjuryDeathRate * 
                                                           LBWPct * LifeExpect * 
                                                           MammAnnualPct * SmokersPct *
                                                           UnemployedPct * UninsAdultPct * 
                                                           VaccinatedPct, 
                                                           direction = "forward", trace = 0)

influential_points = which(cooks.distance(selected_model) > 4 / length(cooks.distance(selected_model)))
outliers = as.vector(as.integer(names(rstandard(selected_model)[abs(rstandard(selected_model)) > 2])))
noise = c(outliers, influential_points)

new_data_1 = data_1[-noise, ]
new_empty_model = lm(log(YPLLRate) ~ 1, data = new_data_1)

final_model_1 = step(new_empty_model, scope = log(YPLLRate) ~ HealthBadDaysMentAvg * HospPrevRate * 
                                                              Income20thpcntl * InjuryDeathRate * 
                                                              LBWPct * LifeExpect * 
                                                              MammAnnualPct * SmokersPct * 
                                                              UnemployedPct *
                                                              UninsAdultPct *  VaccinatedPct, 
                                                              direction = "forward", trace = 0)

Approach 2

In this method, we started by selecting a small number of predictors that showed a strong connection to the response variable “Years of Potential Life Lost Rate.” This selection was done based on the study of the variables from the data source https://www.countyhealthrankings.org/explore-health-rankings/rankings-data-documentation

This set of variables were added to smaller file to make the loading and knitting of the markdown document more efficient.

model_2_DF = read_csv("data//model_2_data.csv") 

model_2_DF$`WaterViol` = as.factor(model_2_DF$`WaterViol`)

model_2_DF = model_2_DF[sample(1:nrow(model_2_DF), 2000, replace=FALSE),]

# Split into test and train
indexes = sample(nrow(model_2_DF), 1000)
train_df = model_2_DF[indexes,]
test_df = model_2_DF[-indexes,]
  • Through a pairs plot of the variables we can see that some variables appear to have collinearity issues.
  • There are others that may benefit from transformations. We will explore this further.

  • Various transformations were applied to the predictor variables and only one of them showed a marked difference from the transformation. But this variable was dropped later in the final model derived in this section.

  • We started by fitting a model with all predictors. Fitted vs Residuals plot of this model shows outliers that could be influencing the model. In the next step we removed the ouliers and can see from the Fitted vs Residuals and Q-Qplot that it greatly improved the variance distribution.
# FULL model, with all predictors
model_2_full = lm(YPLLRate ~ . , data = model_2_DF)
# Check for Outliers and Influence.
outliers = as.vector(as.integer(names(rstandard(model_2_full)[abs(rstandard(model_2_full)) > 2])))
high_influence = as.vector(which(cooks.distance(model_2_full) > 4 / length(cooks.distance(model_2_full))))

# Remove outliers
remove_noise = c(outliers, high_influence)
model_2_DF_clean = model_2_DF[-remove_noise,]

model_2_full_cl = lm(YPLLRate ~ . , data = model_2_DF_clean)
(model_2_aic = step(model_2_full_cl, direction = "backward", trace=0))
## 
## Call:
## lm(formula = YPLLRate ~ LBWPct + ObeseAdultsPct + DrinkExcPct + 
##     HSGradRate + DistressFreqPhysPct + HIVRate + IncHousehldMedian + 
##     HomeownersPct, data = model_2_DF_clean)
## 
## Coefficients:
##         (Intercept)               LBWPct       ObeseAdultsPct  
##           2010.2298             297.3129              38.5277  
##         DrinkExcPct           HSGradRate  DistressFreqPhysPct  
##           -116.4524              15.0338             301.7219  
##             HIVRate    IncHousehldMedian        HomeownersPct  
##              0.3694              -0.0459              29.4965

We tried several approaches to improve this model but the best model we could achieve was by applying box-cox transform to the response variable of a model selected via backward-AIC. model

boxcox(model_2_aic, lambda = seq(-0.25, 0.75, by = 0.05), plotit = TRUE)

model_2_cox = lm(formula = (((YPLLRate ^ 0.6) - 1) / 0.6) ~ LBWPct + ObeseAdultsPct + DrinkExcPct +  HSGradRate + DistressFreqPhysPct + HIVRate + IncHousehldMedian +  HomeownersPct, data = model_2_DF_clean)

Plots

Approach 3

Data Cleaning:

We started with a script that merged two sheets from the original data. The team chose to keep variables that contained rates and to drop race demographics. We started with 92 variables and 3141 observations. For this approach, we then dropped columns with too many NA’s (more than our result YPLLRate) and then omitted observations containing ‘NA’. This left us with 71 variables and 2333 observations. Lastly, several variables were coerced to factors and short names were added.

Test Statistics used:

adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. It indicates the percent of the variation in the output variables that are explained by the input variables.

LOOCV RMSE: An observation is removed and the model is fit the the remaining data and this fit used to predict the value of the deleted observation. This is repeated, n times, for each of the n observations and the mean square error is computed.

Standard Deviation: We divided LOOCV RMSE by standard deviation in order to normalize RMSE. This allowed us to compare RMSE between models, including models with transformed results. All three models applied a different transformation to their result.

Correlation Coefficients were used to assess how highly correlated predictors are with the result. They were also used to consider which predictors may be interacting.

Partial Correlation Coefficients The partial correlation coefficient was used in predictor selection in addition to correlation because it it controls for confounding variables in a model.

Shapiro-Wilk Test: Shapiro test assesses the normality of errors. The null hypothesis assumes data were sampled from a normal distribution. A small p-value rejects the null hypothesis.

Breusch-Pagan-Godfrey Test: The null hypothesis for the bptest is constant variance (homoscedasticity). As always if the p-value is below a threshold (\(\alpha = 0.05\)) then heteroscedasticity is assumed. Heteroscedasticity occurs when the variance of the error terms differ across observations. The BP test statistic shows how big the problem is. Smaller is better.

Assessment of Model 3:

Model 3 has a good adjusted R-squared value of 0.972. LOOCV RMSE is 17.5 and when normalized with the standard deviation the normalized RMSE is 0.16. This is a pretty good value. When the data was split into test and train and run 1000 times, the normalized RMSE was 0.17, confirming our LOOCV RMSE result. The transformation with BoxCox lambda greatly improved the Q-Q plot and the tails were minimal. The Shapiro test had a high P-value which support the null hypothesis of normality of errors. The remaining problem is it fails the BPtest with BP value of 131 and p-value very low. The model still suffers from heteroscedasticity but does well on the other measures. BP value was much improved from over 300 before the BoxCox transform and removal of influential observations. Model 3 uses 23 predictors out of 93 possible and they are representative of the major contributors to health (environment, income, healthcare, demographics). Given the high correlation between many of the predictors and the high variance of the response, this is a decent model.

Residuals Plot

Q-Q Plot

Results

The three models are shown below:

Model_1

formula(final_model_1)
## log(YPLLRate) ~ LifeExpect + InjuryDeathRate + LBWPct + UninsAdultPct + 
##     UnemployedPct + Income20thpcntl + MammAnnualPct + LifeExpect:InjuryDeathRate + 
##     LifeExpect:LBWPct + LBWPct:Income20thpcntl + LifeExpect:UninsAdultPct + 
##     InjuryDeathRate:UninsAdultPct + UninsAdultPct:MammAnnualPct + 
##     UninsAdultPct:UnemployedPct + LifeExpect:UnemployedPct + 
##     InjuryDeathRate:UnemployedPct + LifeExpect:Income20thpcntl + 
##     InjuryDeathRate:Income20thpcntl + InjuryDeathRate:MammAnnualPct + 
##     LBWPct:MammAnnualPct + InjuryDeathRate:LBWPct + LifeExpect:UninsAdultPct:UnemployedPct + 
##     LifeExpect:InjuryDeathRate:UninsAdultPct + LifeExpect:LBWPct:Income20thpcntl

Model_2

formula(model_2_cox)
## (((YPLLRate^0.6) - 1)/0.6) ~ LBWPct + ObeseAdultsPct + DrinkExcPct + 
##     HSGradRate + DistressFreqPhysPct + HIVRate + IncHousehldMedian + 
##     HomeownersPct

Model_3

formula(model_3)
## YPLLRate^lmda_mdl3 ~ HealthPoorPct + FoodEnvirIX + PhysInactPct + 
##     PhysPrimCareRate + MammAnnualPct + Income20thpcntl + IncomeRatio + 
##     SocsAssRate + InjuryDeathRate + HousSvrProbPct + HousInadFacil + 
##     LifeExpect + DeathAgeAdjRate + VaccinatedPct + IncHousehldMedian + 
##     PopGE65Pct + PopNativePct + PopPacificPct + EnglishNotProfPct + 
##     PopRuralPct + PopGE65Pct:DeathAgeAdjRate + HealthPoorPct:PopFemalePct + 
##     LBWPct:IncHousehldMedian

Model Diagnostics

Model 1 Model 2 Model 3
BP Test 0.0000 0.0000 0.0000
Shapiro 0.0746 0.0156 0.3390
No params 25.0000 9.0000 24.0000
LOOV RMSE 0.0535 27.1164 17.5007
Adj R2 0.9618 0.7944 0.9717
RMSE 0.0529 26.9918 17.3025
AIC -12514.0682 12264.2048 12397.8798
Normalized RMSE 0.1835 0.4230 0.1580

Discussion

Appendix